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In computer vision, automatic facial expression recognition (FER) continued 
a difficult and interesting topic. The majority of extant techniques are based 
on traditional features descriptors such as local binary pattern (LBP) and 
histogram of oriented gradient (HOG), in which the classifier's 
hyperparameters are tailored to produce the best recognition accuracies across 


a single database or a small set of similar databases. This paper integrates the 


power of deep learning techniques with the LBP and HOG. The LBP and 


Keywords: HOG are estimated from each image in the dataset. The resulting dataset is 
CNN applied to a convolutional neural network (CNN). The architecture of this 
FER CNN constitutes three convolutional layers and three max-pooling layers. The 
output layers involve BatchNormalization, three dense layers, and two dropout 
HOG layers. The proposed architecture is validated on the extended cohn-kanade 
LBP dataset (CK+). We obtain improvement in the accuracy of the CNN model 

from 0.9593 to 0.967 and 0.975 after using the LBP and HOG respectively. 
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1. INTRODUCTION 

Automatic facial expression detection has vital applications in a wide range of fields, including 
human-computer interaction (HCI), although itis a difficult but fascinating topic. Facial expression recognition 
(FER) is gaining more and more attention because it is extensively used in many domains such as security [1], 
health care [2], human-robot interaction [3], smart living [4], the safety of the driver, animation [5], and e- 
learning. The primary goal of FER is to classify facial expressions into a variety of emotions, including surprise, 
contempt, sad, disgust, anger, fear, and happy. The local binary pattern (LBP) is a strong feature to represent 
texture. The activities of the corresponding muscles on the face produce varied textures when a facial 
expression appears. The histogram of oriented gradient (HOG) is employed to extract features from the image. 
In FER, the HOG and LBP have proven to be effective descriptors and give good results [6]. 

Deep learning techniques have been extensively employed and confirmed to be effective in a variety of 
application domains. Convolutional neural network (CNN) is a type of well-known deep learning algorithm that 
learns directly from the input without the need to extract human features; this property gives it a competitive 
advantage over traditional networks. Recently, CNN has been efficient in the FER [7], [8]. A conventional CNN 
model consists of several layers, each of which performs a specific function. The convolution layer is responsible 
for estimating features, the pooling layer for reducing the size of the preceding layer's features, and the fully- 
connected layer (dense layer) for eliciting high-level features and predicting the model's output. As well, the core 
structure of CNN uses several activation functions including ReLU, Sigmoid, and Tanh. 
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In this paper, we implemented a new CNN model to solve the problem of FER. The HOG/LBP are 
employed with the proposed model to improve the accuracy using the cohn-kanade dataset (CK+) dataset. 
Oliver et al. [9] suggested a way to decrease false positives for the recognition of mammography images 
employing LBP. The results show that LBP features are efficient at reducing false positives across a wide range 
of mass sizes and that the LBP outperformed the current methods. Khandait et al. [10] discovered that the 
height and width of the face sections have proven to be obvious features in facial expression recognition. 
Depend on the elements of the face and the movements of the muscles. The results gave a good performance 
and the accuracy was 95.26% on the JAFFE dataset. Liu et al. [11] suggested a deep architecture called an 
“AU-Aware” comprising three sequential modules. The experiments employed three datasets MMI, CK-+, and 
SFEW. The results demonstrate that the features produced by “AU-Aware” are good and competitive with 
features using HOG, SIFT, Gabor, and LBP. Liu et al. [12] utilized the principal component analysis (PCA) to 
minimize the dimensionality of a huge number of features combined with HOG and LBP. The results gave a 
good performance on JAFFE and CK-+ datasets. 

Kumar et al. [13] have published a comparative study between LBP, deep features, and bag-of-visual- 
words (BoVW) for the classification of histopathological images. The obtained accuracy was 90.62% for using LBP 
and 94.72% for using deep features, While BoVW gave the best accuracy of 96.50%. Alhindi et al. [14] compared 
three classification models and employed one of the following feature extractors: HOG, LBP, and deep features from 
a pre-trained model (VGG19). The experiments are performed on the KIMIA Path960 dataset. The results gave a 
good accuracy of 90.52% using LBP. Nigam et al. [15] implemented feature extraction by recovering the HOG 
feature in the DWT domain and an SVM employed for recognition of the expression. The suggested method 
employed three datasets JAFFE, CK+, and yale face. The results of the suggested approach are efficient for FER and 
are better than the existing approaches. 

Xie et al. [16] incorporated scattered features into deep feature learning to improve the ability of 
generalization of a CNN to recognize facial emotions. The results revealed that the suggested method produced 
good performances on four datasets CK+, Oulu-CASIA, FER2013, and MMI. Zhang et al. [17] utilized the 
shape geometry of the face image by suggesting an end-to-end deep learning model. The suggested model 
depends on a generative adversarial network (GAN) and employs three datasets Multi-PIE, BU-3DFE, and 
SFEW. The results of three datasets display the effectiveness of the model. Sharifnejad et al. [18] compared 
the performance of histograms of oriented gradients, local binary pattern methods, and their combination on 
several regions of a face image. The results produced an accuracy of 95.33% using three regions of the face: 
the eyes, mouth, and nose employing the CK dataset. 


2. METHOD 

In this work, we suggested a new CNN model and employed the HOG and LBP to improve the accuracy 
of this model. The suggested model contains three convolutional layers and three max-pooling layers. The output 
layers involve BatchNormalization, three dense layers, and two dropout layers. The input of the model is 224x224 
pixels. The convolutional layers employ 32, 64, and 128 filters of size 3x3. The type of padding is ‘same’, and 
the activation function is ReLU. The max-pooling layers employ kernels of size 2x2. The new output layers 
comprise three dense layers and two dropout layers. The activation of the first two dense layers is implemented 
using the ReLU function while the activation for the last dense layer is the softmax function. The number of 
neurons for the three dense layers is 128, 64, and 7, respectively. On the other hand, the percentage of dropout is 
0.5. Figure 1 clarifies the stages of the proposed system for recognizing facial expressions. 


Training set 
Training 
process 
Dataset Preprocessing HOG / LBP Training / Testing CNN training / 
loading extraction division testing process 
Testing 
process 


Testing set 


Figure 1. The proposed FER system 


The first stage involves the preprocessing process in which each image in our selected FER dataset 
(CK+) is resized to 224224 pixels. Pre-processing can be performed before the feature extraction procedure. 
The second stage explains the use of (HOG/LBP) that are employed in the proposed model. In the third stage, 
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we train and test three models separately: the proposed CNN, LBP with CNN, and HOG with CNN. The fourth 
stage involves computing the classification results, which involves assigning each image in the testing set to 
one of seven expressions (surprise, fear, contempt, disgust, angry, happy, and sad). 

HOG: It describes features by computing the occurrence of gradient orientation appearing in a certain 
area of an image. HOG divides the image into various cells and computes the gradients over them. It was 
suggested by Dalal and Triggs [19]. Suppose the intensity (grayscale) function represent (I), which describes 
the image to be analyzed. The image is partitioned into cells of size KxK pixels (as depicted in 
Figure 2(a)), for the x-axis: Gy =/(x+J, y)-I(x-J, y) and for the y-axis: Gy=I(x, y+1)-I(x, y-1), the magnitude of 
the gradient is calculated as: 


M(x,y) = le, + Gy? (1) 


And the gradient's Q (x, y) orientation is calculated in each pixel (see Figures 2(b) and (c)) as: 
Q(x, y) = tan} z (2) 


An M-bins histogram of orientations is created by computing the orientation of all pixels and 
accumulating it (see Figures 2(d) and (e)). To generate the final features vector, all cell histograms are 
concatenated (see Figure 2(f)) [20]. HOG is utilized in image processing and computer vision. It performs good 
results on feature detection of FER [21]. 
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Figure 2. The process of extracting HOG features [20], (a) original image, (b) image cell, (c) gradient 
orientation, (d) orientation accumulation, (e) cell histogram, and (f) histogram concatenation 


LBP: It is a texture descriptor, so it is a useful feature for image texture classification [22]. It was 
presented by Ojala et al. [23]. LBP names the pixels in an image by specifying the vicinity of each pixel with 
the central value and treating the result as a binary number. Concatenating all the binary codes in a clockwise 
direction, starting with the top-left one, yields a binary number, and the associated decimal value is employed 
for labeling [24]. In decimal form, the resulting (LBP code) is as: 


LBPmp = mao = ji)2™ (3) 


Where jm represents the intensity value of the neighboring pixel and j; represents the intensity value of the 
central pixel. The m represents the number of pixels in a circular neighborhood and b represents the radius to 
the circular neighborhood. The threshold function is defined as: 


WU) = {0 ipro (4) 

For further analysis, these codes' histogram is then employed. Figure 3 shows the LBP encoding 
method. Firstly, the image is encoded as an LBP image, then separated into patches, with one LBP histogram 
derived from each patch. The final feature vector is created by concatenating the LBP histograms of all patches 


[25]. LBP could be employed to describe the expressions of the facial, which gives good results in face 
recognition [26]. 
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Figure 3. The LBP encoding method [25] 


3. RESULTS AND DISCUSSION 
3.1. Database 

CK+: the extended cohn-kanade, this database involves 593 sequences of images from 123 subjects, 
327 images are labeled with one of the seven main facial expressions (surprise, fear, contempt, disgust, angry, 
happy, and sad). This dataset is recorded under controlled conditions in the laboratory. The most typical data 
selection strategy for static-based methods extracts the final one to three frames of each sequence with peak 
creation and the beginning frame (neutral face) [27]. Figure 4 depicts a set of training samples chosen from the 
CK-+ dataset. 
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Figure 4. A set of training samples was chosen from the CK+ dataset 


3.2. Experimental results 

This section devotes to study the effect of using feature descriptors (LBP and HOG) on the accuracy 
of FER when used with a suggested model. To achieve this, we explore the efficiency of three models: CNN 
alone, HOG+CNN, and LBP+CNN. The test accuracy of the proposed CNN model alone, reaching 0.9593. 
After that, we trained the LBP technique with CNN, and we get an accuracy of 0.9675. In the third experiment, 
we combined the HOG technique with CNN. The accuracy further increased to 0.9756. The validation accuracy 
and loss are outlined in Table 1. 


Table 1. The validation accuracy and loss for three models 


Model Val Accuracy Val Loss 
CNN model 0.9593 0.8281 
LBP+CNN model 0.9675 0.1154 
HOG+CNN model 0.9756 0.7213 
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The loss of using the LBP technique is less than the HOG technique, as shown in Table 1. We trained 
the proposed model employing Adam optimization algorithm by splitting the CK+ dataset (75% used for 
training and 25% for testing). The other parameters employed in the training process are depicted in Table 2. 
The models’ behavior with accuracy and loss of the CNN, LBP + CNN, HOG + CNN are demonstrated in 
Figures 5(a), (b), and (c) (in appendix) respectively. Furthermore, the confusion matrices of the CNN, 


LBP + CNN, HOG + CNN models are depicted in Figures 6(a), (b), and (c) respectively. 


Table 2. The parameters that the model employed for training 


Parameters The value 
Epochs 25 
Optimizer Adam 
Batch size 64 
Learning rate 0.0001 
Confusion matrix Confusion matrix 
-E 0.0 00 0.0 00 0.0 0.06 anger ARE] 0.0 00 0.0 0.0 0.0 0.11 
fear{ 00 00 0.0 0.0 0.0 0.0 fear 0.0 0.0 0.0 00 0.0 0.0 
surprise | 00 0.0 0.0 0.0 0.0 00 surprise 0.0 0.0 0.0 0.0 0.0 0.0 
Z E 
2 contempt | 0.06 0.24 00 0.0 0.0 0.12 = contempt 0.0 0.0 00 0.0 0.0 0.0 
E © 
disgust | 002 0.0 0.0 0.0 0.0 0.0 disgust | 0.04 0.0 0.0 0.0 0.0 0.0 
happy) 0.0 0.0 0.0 0.0 0.0 0.0 happy, 20 0.0 0.0 0.0 0.04 0.0 
sadness | 00 0.0 0.0 0.0 0.0 0.0 sadness 0.0 0.0 0.0 0.0 0.0 0.0 
r r r r r r + r r + Se 
ó $ A + ò P ó 5 g $ $ à $ 
g & S Ki $ É g € g S Y S Cá 
S s Fi a 3 S s 8 K g 
Predicted label Predicted label 
(a) (b) 
Confusion matrix 
anger 10 00 0.0 00 00 00 00 
fear] 0.0 0.0 00 00 00 00 
suprise | 00 00 00 00 00 00 
Z 
2 contempt] 022 0.12 0.0 00 00 0.06 
© 
disgust | 0.02 0.0 0.0 00 00 0.0 
happy 00 00 0.0 00 00 00 
sadness | 0.0 00 0.0 00 0.0 00 
$ $ g $ & P: 
Oa e X & S é 
F s s$ 8? K P 


Figure 6. The confusion matrix of three models (a) CNN model, (b) LBP+CNN model, and 
(c) HOG+CNN model 
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4. CONCLUSION 

In this work, we presented a new neural network architecture to recognize facial expressions. In 
addition, we used HOG, LBP techniques with the CNN to increase the accuracy of the proposed model. The 
initial phase in our model is feature extraction. The LBP is known as an efficient feature for analyzing facial 
images (texture descriptor). In a grayscale range, HOG depicts the necessary features from an image. We 
trained three models: CNN, LBP+CNN, and HOG+CNN. All the assessments were implemented on the CK+ 
dataset. The HOG+CNN achieved a high accuracy of 0.9756, while the other models also achieved a good 
accuracy of 0.9675 for using LBP and 0.9593 for using a new CNN model. The results are elementary and may 
be improved in the future as our efforts to acquire better results continue to look into additional factors that 
may influence accuracy. 
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Figure 5. The development of the model performance as the epoch progresses in three models (a) CNN 
model, (b) LBP+CNN model 
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Figure 5. The development of the model performance as the epoch progresses in three models 
(c) HOG+CNN model (continue) 
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